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Abstract 



Multiple genome alignment remains a challenging problem. Effects of recombination including 
rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of orthology 
even among closely related organisms. We describe a method to align two or more genomes that 
have undergone large-scale recombination, particularly genomes that have undergone substantial 
amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, 
referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering 
method to remove erroneous alignments of unrelated sequences, which are commonly observed in 
other genome alignment methods. 

We describe new metrics for quantifying genome alignment accuracy which measure the quality 
of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment 
algorithm demonstrates vastly improved accuracy over previous approaches in situations where 
genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplica- 
tion. 

We apply the progressive genome alignment algorithm to a set of 23 completely sequenced 
genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple 
alignments allows us to extend the previously defined concepts of core- and pan-genomes to include 
not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 
enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan- 
genome of 15.2Mbp. We document substantial population-level variability among these organisms 
driven by homologous recombination, gene gain, and gene loss. Interestingly, much variability 
lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence. 
In summary, the multiple genome alignments generated by our software provide a platform for 
comparative genomic and population genomic studies. 

Free, open-source software implementing the described genome alignment approach is available 
from http : //gel . ahabs .wis e . edu/mauve^, 



INTRODUCTION 



Multiple genome alignment is among the most basic tools in the comparative genomics toolbox, 



however its application has been hampered by concerns of accuracy and practicality (Kumar and 



Filipski 2007[ Prakash and Tompa 2007 Lunter 2007). Accurate genome alignment represents a 
necessary prerequisite for myriad comparative genomic analyses. 

The definition of a "genome alignment" has been the source of some confusion in compara- 
tive genomics. Genome alignment goes beyond basic identification of homologous subsequences to 
subclassify homology relationships by specific types of evolutionary history (Dewey and Pachter 



2006 



2000 



At a gross level, subtypes of homology include orthology, paralogy, and xenology ( Fitch 
Paralogs can be further classified as in- and out-par alogs, while orthologs can be further dis- 
tinguished as single-copy orthologs (1-to-l orthologs), tandemly repeated orthologs (co-orthologs), 
positional, and general orthologs. For the present work, we are interested in genome alignments 
that distinguish single-copy orthologous and xenologous sites from the remainder of homologous 
and unrelated sites. Downstream application of recombination detection methods can distinguish 
single-copy orthologs from xenologs. 

Early genomic studies on E. coli revealed substantial gene content variability among individual 
E. coli isolates (Perna et al. 2001 Welch et al. 2002 1 . Since then, gene content variability has 



been reported as a common feature in numerous other microbial species ( Tettelin et al. 2005 Hogg 



et al. 


2007 


Hsiao et al. 


2005 iVernikos et al. 


2007 



studies have been limited to gene-based methods by the difficulty of complete and accurate multiple 
genome alignment. 

Approaches to whole-genome alignment typically reduce the alignment search space using an- 



choring heuristics (Ogurtsov et al. 


2002 


iBrudno et al. 


2003a 


Hohl et al. 2002 


Bray and Pachter 


2003 


Blanchette et al. , 2004 1 or banded dynamic programming (Chao et al.| 


1995 ) . Anchoring 



heuristics appear to provide a good tradeoff between speed and sensitivity. Most anchored align- 
ment methods assume that the input sequences are free from genomic rearrangement. As such, a 
separate synteny mapping algorithm must be applied to map collinear homologous segments among 
two or more genomes prior to alignment. Synteny mapping approaches are too numerous to list, 
however most involve computing reciprocal best BLAST hits on putative ORFs, with BLAST hits 
filtered by e-value thresholds, coverage thresholds, and uniqueness criteria. Some synteny mapping 
methods apply genomic context to help resolve ambiguous orthology /paralogy relationships, and 
others use probabilistic transitive homology approaches to infer orthologs among distantly related 



taxa (Li et al. 20031. 



Integrated approaches to synteny mapping and alignment have been proposed, most of which 

Brudno et al 



operate on pairs of genomes (Kurtz et al. 2004 



2003b Vinh et al. , 2006 



et al. , 2006 1 . Research into multiple alignment with rearrangements has been limited, a 



Swidan 

^ ^ though 

some progress has been made ( Darling et al.l, '2004a' |Treangen and Messeguer 2006 Raphael et al.[ 
Ovcharenko et al. 2005 Phuong et al^ ^006 T Apart from greater ease-of-use, integrated 



2004 



synteny mapping and alignment methods have the advantage of providing more accurate inference 
because the alignment can influence the synteny map and vice-versa. 

In the present work, we introduce a novel computational method to construct a multiple align- 
ment of a large number of microbial genomes which have undergone gene flux and rearrangement. 



Unlike our previous alignment method (Darling et al. , 2004a), the new method aligns regions con- 



served among subsets of taxa. Three algorithmic innovations factor strongly in our method's ability 
to align genomes with variable gene content and rearrangement. The first is a novel objective func- 
tion, called a sum-of-pairs breakpoint score, to score possible configurations of alignment anchors 
across multiple genomes. Our second algorithmic contribution is a greedy heuristic to optimize 
a set of anchors under the sum-of-pairs breakpoint score. We demonstrate that most anchored 
alignment techniques suffer a bias leading to erroneous alignment of unrelated sequence in regions 
containing differential gene content. Our final algorithmic contribution is application of a homology 
hidden Markov model (HMM) to reject such erroneous alignments of unrelated sequence. 
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We compare accuracy of existing methods and the new ahgnment method on datasets simu- 
lated to encompass a broad range of genomic mutation types and rates, including inversion, gene 
gain, loss, and duplication. We then apply the multiple genome alignment method to a group of 
23 finished genomes in the family Enterobacteriacae (Supplementary Table [T|. We precisely iden- 
tify the core- and pan- genomes of this group, and report basic analysis ofgene flux patterns in 
Enterobacteriacae. 



METHODS 

An overview of our method as applied to three hypothetical genomes appears in Figure [T] and is 
presently described in detail. 



Notation and assumptions 

Our genome alignment algorithm takes as input a set of G genome sequences gi, §2, ■ ■ ■ Og £ G. 
We denote the length of genome i as \gi\. Contigs in unfinished or multi-chromosome genomes are 
concatenated to form a single coordinate system. Our method computes alignments along a rooted 
guide tree ^ which is computed from G, and we use to denote an arbitrary node in ^. As 
^' is a rooted bifurcating tree, an internal node always has two children, which we refer to as 
^n-Ci and ^n-C2- Furthermore, we define the set of leaf nodes at or below as Each leaf 

node represents a genome from the set of input genomes G. We use set theoretic notation in some 
places, where \ indicates removal of members from a set. 

Various default parameter settings in our software implementation depend on the average length 
of input genome sequences. We refer to the average genome length as AvgSize{G). 



Local multiple alignments as potential anchors 

We identify local multiple alignment s (LM A) as potential anchors using families of palindromic 
sp aced seed patterns ( Darling et al. , [2006 1 in a seed-and-exten d hashing method (see Appendix 
of Darling et al. ( 2004a ) ) . A spaced seed pattern ( Ma et al. 2002 ) identifies the location of fc-mers in 
the input genomes that have identical nucleotide sequence except that a small number of mutations 
are allowed at fixed positions. For example, the seed pattern 11*11*11 would identify matching 
oligomers of length 8 where the 3rd and 6th positions are degenerate. The number of I's in the 
seed pattern is commonly referred to as the weight of the seed pattern, and a pattern is said to be 
palindromic if the pattern is identical when read forward or in reverse. A seed family is a collection 
of seed patterns that when used in conjunction provide improved matching sensitivity, and such 



families have been previously demonstrated to offer excellent speed and sensitivity (Kucherov et al. 



20051. 



To minimize compute time and focus anchoring coverage on single-copy regions, our method 
only extends seeds which are unique in two or more genomes. By default, we use seed patterns 
with weight equal to log2{AvgSize{G)/1.5). The resulting local multiple alignments are ungapped 
and always align a contiguous subsequence of two or more genomes in G. Any given local multiple 
alignment m can be described formally by its length m. Length and a tuple {xi,X2, ■ ■ ■ , xq) where 
Xi is an integer denoting the left-end coordinate of the LMA in genome gi. Genomic coordinates are 
assumed to start at 1, with negative values denoting alignments to the reverse strand, and when Xj 
takes on a value of 0, it indicates that genome gi is not part of the LMA. The LMAs found by our 
procedure are ungapped alignments of unique subsequences and thus are similar to multi-Maximal- 
Unique-Matches (multi-MUMs), but may contain mismatches according to the palindromic seed 
patterns. As with multi-MUMs, any portion of a unique LMA may be non-unique. We refer to the 
complete set of local multiple alignments generated in this step as M. 



Guide tree construction 

We compute a genome-content distance matrix and Neighbor Joining phylogeny using the method 
described in (Henz et al. 20051, adapted to work on local multiple alignments instead of BLAST 
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hits. We midpoint-root the resulting tree to yield our progressive alignment guide tree ^. Steps 2 
and 3 in Figure [T] illustrate guide tree construction. 



Local alignment anchor scoring 

For a given local multiple alignment m which includes genomes gi and gj, we compute a pairwise 



substitution score using a substitution matrix, which defaults to the HOXD matrix (Chiaromonte 



et al. 2002, ). The HOXD matrix appears to discriminate well between homologous and unrelated 



sequence in a variety of organisms, even at high levels of sequence divergence. 

The substitution matrix score quantifies the log-odds ratio that a pair of nucleotides share com- 
mon ancestry, but does not account for the inherent repetitive nature of genomic sequence. Our 
desire to discriminate between alignment anchors that suggest orthology (or xenology) and align- 
ments of regions with random similarity or paralogy requires that we somehow consider repetitive 



genomic sequence in our anchoring score (Lippert et al. 2004). To do so, we compute a quantity 



termed Socc{gi,x), which is the number of occurrences in gi of the spaced seed pattern at site x. 

We then combine the traditional substitution score for a pair of nucleotides with an adjustment 
for the uniqueness of the aligned positions: 



Q{9i,x,gj,y) ii Q{gi,x,gj,y) <0 

s{gi, X, gj,y) = ^ (1) 
sJS^'-yt'cfas^y) - X, gj , y) otherwise 

where Q{gi,x, gj,y) is defined as the substitution matrix score for the nucleotides at sites x of 
genome gi and y of gj . The product of uniqueness scores for the two aligned positions approximates 
the number of possible ways that repetitive sites could be combined. For example, consider a repeat 
element present in both genomes with copy number r j in genome gi and copy number rj in gj . There 
are Virj possible pairs of repeats. When a pair of nucleotides in a repeat element have a positive 
substitution score, the product Socc{gi,x) ■ Socc{gj,y) down-weights the score. 

We refer to the total pairwise anchor score for match m in genomes gi and gj as S{m, gi, gj); 
simply defined as the sum of scores for each alignment column: 

m. Length 

S{m,gi,gj)= ^ s{ gi,pos{m, gi, c), gj,pos{m, gj, c) ) (2) 

c=l 

where pos{m, gi, c) is defined as the position in gi represented by column c of local alignment 
m. In summary, the local alignment scoring scheme assigns high scores to well-conserved regions 
that are unique in each genome and does not consider gap penalties. 

Locally collinear blocks 

A pair of genomes gi and gj may have undergone numerous genomic rearrangements since their most 
recent common ancestor. As such, local alignments among orthologous segments of gi and gj may 
align segments that occur in a different order or orientation in each genome. We define a pairwise 
Locally Collinear Block (LCB) as a subset of local alignments in M that occur in the same order and 
orientation in a pair of genomes gi and gj, i.e. they are free from internal rearrangement. To define 
pairwise LCBs among genomes gi and gj, we first define the projection of the current set of Local 

Multiple Alignments M onto gi and gj as M : gt^, computed by setting all left-end coordinates 
for genomes {gi,gj} \ G to 0. In alignment m, for example, we would compute the projection 
onto gi and gj by setting all values in the left-end coordinate tuple {xi,X2, ■ ■ ■ ,xc) to except 

for Xi and Xj. We can then minimally partition the pairwise alignment projections M : gi ^ into 
locally collinear blocks using the well-known breakpoint analysis procedure (Darling et al. 2004b 



Blanchette et al. 



1997 1 . We refer to the minimal partitioning of M : gi,gj into locally collinear 
blocks as Lc6(M : gi,gj), and the resulting subsets of M : 'gi^ as Ljj ■ ■ ■ Lfj G Lcb(M. : ^TT^j)- 
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(1) Identify ungapped local multiple alignments 

Matches among three genomes (A,B,C) are shown as linked boxes. 
Matches to a genome's reverse strand are shifted downward 




(2) Compute a pairwise distance matrix on 
single-copy gene content/substitutions 

A nucleotide is considered "covered" if it is both contained in a match 
and identical in the other genome. Overlapping matches are trimmed. 
|- trimmed 

rnn ^ 



A 
B 

B 
C 

A 
C 



A: 60% covered 
B: 70% covered 
AB average 65% 



^^^^^ 



"trimmed 




:21c 



B: 62% covered 
C:50% covered 
BC average 56% 



A: 70% covered 
C:70% covered 
AC average 70% 



Pairwise coverage values are substracted from 
one to yield a distance value. The matrix 
is used to infer a guide tree and to scale 
the breakpoint penalty during anchoring. 



0.35 




0.30 
0.44 






(5) Recursive anchor search 

Genome A 

For each pair of genomes 
regions between anchors 
(white) and regions outside 
LCBs (yellow) are searched 
for new anchors. Newly 
found anchors are shown in 
purple. We now return to 
step 4 (not shown). 



(6) Perform global anchored alignment 

The current set of anchors are used for profile-profile alignment. 
Inter-anchor regions shown in white and yellow above are aligned. 

(7) Evaluate whether all genomes are aligned 

Genome B remains unaligned, so return to step 4. 



(4.ii) Anchor nearest unaligned sequences 
in guide tree 

A and C are already aligned, so we must align B to D,and use 
BA and BC anchors to do so. We enforce consistency among 
anchors and apply greedy breakpoint elimination. 



2 3 \, 



B 
B 



\ 5 
< 



(3) Infer anchoring guide tree on gene content/ 
substition distances 

Use Neighbor-Joining, apply midpoint rooting 



-0 c 



-• A 



B 



(4) Anchor nearest unaligned sequences 
in guide tree 

A and C are closest. Overlapping matches are trimmed, then 
we apply sum-of-pairs greedy breakpoint elimination to select 
anchors (see text). Breakpoints are shown as arrows. Matches 
in gray were removed, eliminating the breakpoints in gray and 
improving the Sum-of-pairs anchor score. If the anchor score 
has not improved after an iteration of step 5, we skip to step 6. 
\ \ 

A 




Breakpoints are shown as red arrows, and matches are 
assigned numeric identifiers. The pairwise LCBs among A,B are: 
{1 ,2,3}, {4}, {5}. Pairwise LCBs among B,C are: {1 },{3},{4},{5}. The 
effect of removing matches in each LCB is evaluated with the 
Sum-of-pairs anchor score, and matches whose removal yields 
the largest improvement are removed. In this case, the score 
can not be improved so we continue to step 5.ii. 

(5.ii) Recursive anchor search 

As above, but among the pairs BA and BC 

(6.ii) Perform anchored alignment with MUSCLE 

As above, but among B and the existing AC alignent. 

(7.ii) Evaluate whether all genomes are aligned 

All genomes are now aligned, continue to step 8. 

(8) Reject alignments of unrelated sequence 
with a Homology HMM 

Use the method ofTreangen etal 2008 to eliminate gap dribble. 



Figure 1: Overview of the alignment algorithm using three example genomes A, B, and C. 
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Sum-of-pairs anchoring objective function 

Our method computes anchored aUgnments progressively according to a guide tree. When com- 
puting an alignment at a given internal node of the guide tree we optimize the following 
sum-of-pairs LCB anchoring objective function to select a set of alignment anchors: 



-b\Lcb{M : gCgj)\ + ^ S{m,gi,gj) 



SPAnchorScore{'i/n,M) = ^ ^ 

^ (3) 
where |Lc6(M : gi,gj)\ refers to the total number of LCBs in the set of local alignments among 
genomes gi and gj. The coefficient 6 is a breakpoint penalty, which when multiplied by the number 

of LCBs |Lc6(M : giC^j)], creates a scoring penalty that increases in magnitude when the anchors 

in M : gi'^ induce a larger number of LCBs. The value of the breakpoint penalty 6 is a user- 
controlled parameter in our implementation of the algorithm, and we use a default value of 30,000 
as manual experimentation on real genome sequence data suggests this value represents a good 
tradeoff between sensitivity to small genomic rearrangements and filtration of spurious alignments. 
Thus, anchor sets inducing fewer breakpoints are given higher scores. 

Optimizing the SP anchoring objective function: greedy breakpoint elimination 

We apply a greedy breakpoint elimination heuristic to optimize SP Anchor S cor e{'^n,^) which 
removes potential anchors from M until the score can no longer be increased. Specifically, if all local 
alignments comprising an LCB Lij are removed from M, then the number of LCBs in Lc6[Mj,) 
will decrease by at least one and at most four if neighboring LCBs coalesce (Darling et al. 2004a[). 



The decreased number of LCBs, and hence breakpoints, reduces the total breakpoint penalty in 
S P Anchors cor ei^n-, M). 

Our algorithm repeatedly identifies the LCB whose removal from M would provide the largest 
increase to SP Anchor Scot e{'^n^^) and removes the local alignments which comprise that LCB 
from M. This procedure corresponds to step 4 in Figure [Tj Formally, we identify Ljj- which 
satisfies: 

max < max SPAnchorScorei'^n,^'^ Li j)\ (4) 

yL^,j£Lcb{M.:gi,gj) ) 

and remove the local alignments in Lij from M. The greedy breakpoint elimination process repeats 
until further removal of LCBs fails to improve the SP anchoring score. 

Recursive anchoring 

The initial set of local alignments in M is typically computed using a seed weight that finds 
local alignments in unique regions of high sequence identity. As such, the initial set of anchors 
frequently misses homologous regions with lower sequence identity. After anchor selection by greedy 
breakpoint elimination (Equation E|, our method searches for additional local alignments between 
anchors existing among all pairs of genomes in £(^'„.ci) and jC{^n-C2), see Figure [T] step 5. To 
improve sensitivity during recursive anchor search, smaller seed weights are used as described 



by Darling et al. (2004a I . Any new local alignments are added to M. After the recursive anchor 
search, we apply greedy breakpoint elimination to optimize the SP anchor score once again. The 
recursive anchoring process repeats until SPAnchorScore{^n,^) no longer improves. 

Anchored profile alignment and iterative refinement 

Each LCB at contains alignment anchors, which we use to perform an anchored profile-profile 



global alignment using modified MUSCLE software (Edgar 2004 1. After the initial profile-profile 
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alignment, we apply window-based iterative refinement to improve the alignment. Step 6 of Figure [T] 
corresponds to this process. Importantly, MUSCLE attempts to refine the alignment on a multitude 
of alternative guide trees and is not restricted to the guide tree chosen for progressive anchoring. 
The use of multiple guide trees is a particularly important feature in microbial genomes, which 
are subject to extensive lateral gene transfer. It should be noted that our use of MUSCLE as a 
refinement step is an approach used in other software pipelines as well (Margulies et al. 2007). 



Rejecting alignments of unrelated sequences 

Segments of DNA between high-scoring alignment anchors are often unrelated, especially in bac- 
teria. Despite that, our method (like most other genome aligners) applies a global alignment 
algorithm to all inter- anchor segments, naively assuming that homology exists. Our assumption of 
homology often proves erroneous, so to arrive at an accurate alignment we must detect forced align- 
ment of unrelated sequence. To do so, we apply an HMM posterior decoder that classifies columns 
in a pairwise alignment as either homologous or unrelated. The HMM structure, transition, and 
emission probabilities are described elsewhere (Treangen et al. , 2008 1 . The HMM makes predictions 



of pairwise homology, which we combine using transitive homology relationships. Regions found to 
be unrelated are removed from the final alignment. Application of the homology HMM is the final 
step in the alignment procedure, shown as step 8 in Figure [T| 



Implementation 

The alignment algorithm has been implemented in the progressiveMauve program included with 
Mauve v2.0 and later. The program is open source C-|— |- code (GPL), with 32- and 64- bit binaries 
for Windows, Linux, and Mac OS X available from http://gel.ahabs.wisc.edu/mauve, Time- 



consuming portions of the algorithm have been parallelized using OpenMP 2.5 ( Chandra et al.^ 
2001 1. An accessory visualization program is include d (see F igure [7| . Default alignment parameters 
have been calibrated for bacterial genomes ( Darling 2006 1 . 



RESULTS 

Quantifying alignment accuracy 

Our new alignment algorithm uses approximations and computational heuristics to compute align- 
ments. To understand the quality of alignments produced by our approach it is essential to ob- 
jectively quantify alignment accuracy. Without a known 'correct' genome alignment, automated 
alignment heuristics can not be evaluated for accuracy. Although several benchmark data sets 
exist for protein sequence alignment (Thompson et al. 1999 Edgar 2004 1, no such benchmark 



data sets exist for genome alignment with rearrangement. Thus far, manual curation of a whole- 
genome multiple alignment that includes rearrangement and lateral gene transfer has proven too 
time-consuming and difficult. Despite the lack of a manually curated correct alignment, we can es- 
timate the alignment accuracy by modeling evolution and aligning simulated data sets. All results 
described in this section and the programs used to generate them are available as supplementary 
material. 



Simulated evolution model 

In previous work, we constructed a genome evolution simulator that captures the major types. 



patterns, and frequencies of mutation events in the genomes of Enterobacteriacae (Darling et al. 



2004a). We use the same simulated model of evolution in the present study but with different 
evolutionary parameters. Given a rooted phylogenetic tree and an ancestral sequence we generate 
evolved sequences for each internal and leaf node of the tree, along with a multiple sequence 
alignment of regions conserved throughout the simulated evolution. Along the branches, mutations 
such as nucleotide substitution, indels, gene gain/loss, and inversion rearrangements are modeled as 
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a marked Poisson process. We score calculated alignments against the correct alignments generated 
during the evolution process. 

Although gene duplication does occur in bacteria, we do not explicitly model it here as dupli- 



cations tend to be unstable in bacterial chromosomes and are often counterselected (Achaz et al 



20031. Instead, we indirectly account for gene duplications in two ways. First, the source DNA 
sequence for gene gain events comes from a IMbp pool of sequence. At moderate to high simulated 
rates of gene gain, many megabases of DNA are sampled from the donor pool, and as a result, 
identical donor sequence gets inserted into the simulated genomes in multiple places. The effect is 
similar to a dispersed repeat family, such as bacterial IS elements or mammalian SINE elements. 

Second, we use the genome sequence of E. coli 0157:117 as ancestral sequence and as donor 
sequence for all insertion and gene gain events. The E. coli 0157:117 genome has numerous natu- 
rally occurring repeats that are carried on to simulated descendant genomes. By using real genome 
sequence as ancestral sequence, the resulting evolved genomes often have similar nucleotide, din- 
ucleotide, A;-mer composition, repeat copy number and repeat distribution. The unknown natural 
forces governing the evolution of such traits would otherwise be extremely difficult to capture in a 
simulation environment. 

Our experimental results at high mutation rates should be interpreted with caution, however, 
since the more simulated mutations applied, the less a simulated genome will look like a real genome. 

Accuracy evaluation metrics 

Previous studies of alignment accuracy have used a sum-of-pairs scoring scheme to characterize the 



residue level accuracy of the aligner (Thompson et al. 1999 



Darling et al. , 2004a). The experiments 



presented here use sum-of-pairs scoring, but we also define new accuracy measures to quantify each 
alignment system's ability to predict indels and breakpoints of genomic rearrangement. For each 
type of mutation, we define True Positive (TP), False Positive (FP), and False Negative (FN) 
predictions as discussed below. Using these definitions, we can measure the aligner's Sensitivity as 
TP+FN ^^"-^ Positive Predictive Value (PPV) as ^ppqrpp. 

For nucleotide pairs, a TP is a pair aligned in both the calculated and correct alignments. A FP 
is a nucleotide pair in the calculated alignment that is absent from the correct alignment. Likewise, 
a FN is a pair in the correct alignment not present in the calculated alignment. We do not quantify 
True Negative (TN) alignments as the number of TN possibilities is extremely large, growing with 
the product of sequence lengths. 

We classify each indel in the correct alignment as a TP or a FN based on the predicted alignment. 
A true positive indel has at least one correctly aligned nucleotide pair in the diagonal/block on 
either side of the indel and at least one nucleotide correctly aligned to a gap within the indel 
(see Figure [2]). The number of TP indels will never exceed the number of indels in the correct 
alignment. We define FP indel predictions as the number of excess indel predictions beyond the 
true positives. FN indels lack a correctly predicted nucleotide pair in the flanking diagonals/blocks 
or lack predictions of gaps in the correct gapped region. Figure [2] gives examples of each case. 



Aligners are notoriously bad at predicting the exact position of indels (Lunter et al. , 2008). 
Under our definition, a TP indel prediction need not predict the exact boundaries of an indel, 
merely the existence of an indel. This scheme allows us to distinguish cases of missing indel 
predictions from cases where the indel was predicted but not positioned correctly. We quantify 
indel boundary prediction accuracy as the distance between the true boundary and the nearest 
aligned nucleotide pair in the diagonal/blocks which flank the predicted indel. When the predicted 
indel is too large, our metric assigns a positive value to the boundary score. When the predicted 
indel is too small, a negative value is assigned. 

Large indels have historically caused problems for nucleotide aligners, which have a tendency 
to break up large indels into a string of smaller gaps with intermittent aligned sequence. Under 
our definition, a large indel can still be considered as a TP prediction even if it is broken into a 
string of smaller gaps by the aligner (See Figure [2] prediction A for an example). Our rationale is 
that the aligner did correctly predict the presence of unrelated sequence, for which it garners a TP, 
but erroneously predicts additional transitions to and from homology, which are classified as FP 
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Correct alignment: 




□ 



flanking blocks in correct alignment 



5 6 
A C 



7 8 9 

G T A C 

T A C 

7 8 9 



1 TP, 1 FP, FN, Boundaries 0,0 
OSingularTP 



12 3 
A C G 



4 5 
T A 



8 9 

T A 

T A 

7 8 



ITR 2FR OFN, Boundaries +1,-1 
OSingularTP 



1 234567890 
A CGTACGTAC 
ACTGCAT AC 
1234567 89 

OTP 2FP 1 FN, Boundaries?,? 
OSingularTP 



Prediction D: 

1 2 3 4 5 6 7 

AC G T A C G 



A C T G 
12 3 4 
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C G T A C 

C A T A C 

5 6 7 8 9 



ITP OFP OFN, Boundaries 0,-2 
1 SingularTP 



Figure 2: Quantifying indel accuracy. The correct alignment is shown at left and four possible 
predicted alignments are shown as A, B, C, and D. Nucleotides have been assigned a numerical 
identifier. The correct alignment has a single indel which partitions the alignment into three 
sections: the left aligned block, the indel, and the right aligned block. Predicted alignments must 
have one correctly aligned nucleotide pair in each of the three sections to count a true positive indel 
prediction. 



indel predictions. To distinguish whether a TP indel was broken into two or more smaller gaps, we 
define a class of "singular" TP indel predictions as indels that were predicted as a single alignment 
gap. See Figure [2] prediction D for an example of a "singular" TP indel. 

For each pair of genomes we also measure whether the aligner correctly predicts LCBs among 
that pair, yielding a sum-of-pairs LCB accuracy metric. For each pairwise LCB in the true align- 
ment, we record a TP LCB prediction when the predicted alignment contains at least one correctly 
aligned nucleotide pair in that LCB. Pairwise LCBs lacking any correctly predicted nucleotide pairs 
are FN predictions. Finally, pairwise LCBs in the predicted alignment lacking any correctly aligned 
nucleotide pairs are False Positive (FP). Again, we do not measure TN. 

As with indels, we define a separate metric to quantify how well each aligner localizes the exact 
breakpoints of rearrangement. For TP LCB predictions, we record the difference (in nucleotides) 
between the boundaries of the correct LCB and those of the predicted LCB. The resulting value is 
negative when the predicted LCB fails to include the full region of homology, and positive when a 
predicted LCB extends beyond the true boundary. 

Under our definitions of TP, FP, TN, and FN predictions, specificity, which is commonly defined 
as ppq^^) is not a useful metric. The extremely large values taken on by TN would drive the 
quotient to 1 in most cases. 



Accuracy on collinear genomes 

Our first experiment compares the accuracy of Mauve 1.3.0, Progressive Mauve, MLAGAN 2.0, 
MAVID 2.0, and TBA 28-02-2006 when aligning collinear sequences that have undergone increasing 
amounts of nucleotide substitution and indels. For each combination of indel and substitution rate, 



nine genomes are evolved from a IMbp ancestor according to a previously inferred phylogeny (Dar- 
ling et al. 2004a I shown in Figure |4j We then construct alignments of evolved sequences using each 



aligner with default parameters, and quantify sensitivity and positive predictive value, (PPV) for 
nucleotide pair and indel predictions. Three replicates were performed, results shown in Figure |3j 
In general, all aligners perform well on collinear sequence, except for Mauve 1.3.0 which is un- 
able to anchor genomes with high mutation rates. Of the tested aligners, TBA offers the highest 
nucleotide sensitivity, and Progressive Mauve gives the best indel sensitivity and positive predictive 
value in most cases. Despite that, all aligners are quite bad at predicting indels accurately, which 
may be in part due to an inherent loss-of-information introduced during the course of simulated 



evolution (Lunter et al. 20081). We did not test the Pecan aligner here, although a detailed evalu- 



ation of its performance can be found elsewhere (Margulies et al. 2007) and we do perform some 
testing on it below. 



Accuracy in the face of rearrangement and gene flux 

We assessed the relative performance of Mauve 1.3.0, Progressive Mauve, and TBA (Blanchette 



et al. 2004) when aligning genomes with high rates of genomic rearrangement, gene flux, and 



nucleotide substitution. Although the original TBA manuscript did not fully describe alignment 
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Mauve 1.3.0 MLAGAN 2.0 MAVID 2.0 TBA (28-02-2006) Progressive Mauve 




0.4 0.8 0.4 0.8 0.4 0.8 0.4 0.8 0.4 0.8 

Substitutions per site among most distant taxa 



Figure 3: The accuracy of aligners on sequences evolved without rearrangement and with increasing 
nucleotide substitution and indel rates. Aligners were tested on 100 combinations of indel and 
substitution rate, with performance averaged over three replicates. All methods lose accuracy as 
mutation rates grow, and the most accurate alignment method depends on the particular mutation 
rates. Progressive Mauve and MLAGAN exhibit the best indel sensitivity and positive predictive 
value (PPV), while TBA is more sensitive than other methods at extremely high mutation rates. 
MLAGAN did not align genomes without indels within the allotted 10 hours, resulting in the 
black row at the bottom. The asterisk in this figure indicates the combination of indel rate and 
substitution rate expected to be similar to our 23 target genomes. 
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Nt Sensitivity Nt PPV Indel Sensitivity Indel PPV LCB Sensitivity LCB PPV 




Nucleotide substitutions per site among most distant taxa Simulation tree topology 



Nt Sensitivity NtPPV Indel Sensitivity Indel PPV LCB Sensitivity LCB PPV 




Small gene flux events among most distant taxa (thousands) 



Figure 4: Accuracy of Mauve, Progressive Mauve, and TBA when aligning genomes with inversions 
and gene flux. In the experiments shown at top, the inversion rate increases along the y-axis and 
the substitution rate along the rr-axis. The most distant taxa have 0.05 indels per site. Progressive 
Mauve clearly outperforms Mauve 1.3.0 and TBA over the entire space of inversion rates. It should 
be noted that in the UCSC browser alignments TBA is used in conjuction with a separate synteny- 
mapping method to identify rearrangements (Miller et al. 20071, so the performance results given 
here are not cause for alarm. Experiments at bottom quantify aligner performance in the presence 
of small- and large-scale gene flux (gain/loss) events. The y-axis gives the average number of large 
gene flux events [length~Unif(10kbp, 50kbp)] between the most distant taxa, while the x-axis gives 
small gene flux events [length~Geo(200bp)]. Substitution and indel rates are those indicated by 
the asterisk in Figure [Sj and the most distant taxa have 42 inversions on average. The asterisk in 
this flgure indicates a simulation scenario expected to be similar to our 23 target genomes. Once 
again Progressive Mauve outperforms other methods, but all methods break down when faced with 
substantial large-scale gene flux. Of note, when Mauve 1.3.0 attains high PPV it usually does so 
with very poor sensitivity. 
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with genomic rearrangement, the most recent release (dated 28-02-2006) handles it ( Raphael et al.^ 
2004[ Ovcharenko et al. , 20051. For our first set of experiments, shown in the top half of Figure |4| we 
simulated evolution at 100 combinations of substitution and inversion rate. In addition to nucleotide 
and indel accuracy, we also quantify LCB accuracy on this data set. The results indicate that 
Progressive Mauve can accurately align genomes with substantially higher rates of rearrangement 
than other approaches. Although TBA exhibits lackluster performance in comparison to Progressive 
Mauve, comparison with the results for MAVID 2.0 and MLAGAN 2.0 (shown in Supplementary 
Figure [s]) demonstrates that for all rates of inversion, TBA produces much better alignments than 
methods which assume genomes are free from rearrangement. 

For the second set of experiments we simulated genomes with 10 increasing rates of small-scale 
gene flux and 10 increasing rates of large-scale gene flux. Small gene flux events are geometrically 
size distributed with mean 200bp, while large gene flux events have uniform lengths between lOkbp 



and 50kbp. These sizes were chosen to match empirically derived estimates (Darling et al. , 2004a). 
The results, shown in Figure |4j indicate that Mauve 1.3.0 falters when faced with large-scale gene 
flux, while Progressive Mauve and TBA perform significantly better. As gene flux rates increase 
in our model, the amount of orthologous sequence shared among genomes deteriorates, eventually 
reaching zero in the limit of infinitely high gene flux rates. 



Gap dribble and the quality of long gap predictions 

Gene gain and loss events manifest themselves in genome alignments as long gaps. Every predicted 
alignment gap implies at least one insertion or deletion of nucleotides has taken place in the history 
of the organisms under study. Since we would like to quantify the contribution of gene flux to the 
target genomes, it is imperative that predicted alignment gaps be as accurate as possible. 

Current sequence alignment methods typically score pairwise alignments with an affine gap 
scoring scheme consisting of a gap open penalty and a gap extend penalty. In a probabilistic setting, 
the optimal affine-gap alignment corresponds to a viterbi path alignment from a pair-HMM with a 
single pair of insert and delete states (Durbin et al. 1998). However, when aligning genomes which 



have undergone a significant amount of gene gain and loss, an excess of large gaps e xists that doe s 
not fit the gap size distribution imposed by a standard global alignment pair-HMM (Lunter 2007). 



The net result is that under the affine gap model, aligners tend to break up large gaps into a series 
of small gaps interspersed with short stretches of improperly aligned nucleotides. In the spirit of 
classifying systematic alignment errors introduced by Lunter et al. (2008), we refer to this problem 
as gap dribble, since short alignments are dribbled along the large gap. The large number of small 
gaps creates problems when trying to reconstruct the history of gene gain and loss events, since 
they imply a much greater number of insertions and deletions than actually occurred. 

Using our simulated evolution platform, we quantify the performance of each aligner in pre- 
dicting gaps of varying size. We simulated evolution of collinear genomes (no rearrangement) that 
have undergone a realistic amount of gene gain and loss, corresponding to previous estimates for 
the rates of these events in the enterobacteria |Darling et al.| (2004a) . Nucleotide substitutions and 
indels were modeled to occur at the rate indicated by the blue asterisk in Figure |3] and gene gain 
and loss events were modeled to occur with twice the frequency indicated by the blue asterisk in 
Figure |4j Figure |5] Left gives the observed size distribution of gaps. 

We then applied each aligner to the simulated genomes and measured the accuracy of gap 
predictions as a function of gap size. The ahgners Mauve 1.3.0, MAVID 2.0, MLAGAN 2.0, TBA 
28-02-2006, Progressive Mauve, and Pecan vO.7 were tested. Pecan vO.7 is a new aligner that has 



been demonstrated to have excellent performance (Margulies et al. , 2007 Paten et al. 20081 by 



virtue of using probabilistic consistency during the anchoring process. Moreover, Pecan vO.7 uses 
a pair-HMM with an extra gap state specifically designed to model long indels. The reconstructed 
alignments were scored against the true alignments and results for ten replicates were recorded. 

The right side of Figure [5] shows the quality of each aligner's indel predictions as a function 
of the true gap size. Shown is the frequency with which gaps of a particular size are predicted 
as a single gap (singular TP) instead of a string of smaller gaps with interspersed alignments of 
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Figure 5: Left Average size distribution of gaps in an alignment of the nine genomes evolved at 
mutation rates which correspond to previous estimates for the E. coli, Shigella, and Salmonella. The 
gap size distribution was averaged over 10 simulations. Right Fraction of TP indel predictions 
that are singular TP indel predictions by true gap size. Ten replicate simulations of evolution 
with gene gain, gene loss, indels, and nucleotide substitution were performed and alignments were 
computed using each aligner. Predicted indels were classified according to the definitions given 
in Figure [2| namely, a singular True Positive implies the true gap is predicted as a single gap. 
Remaining True Positive indels have the true gap broken up into two or more predicted gaps. For 
each aligner, the fraction of singular predicted gaps is shown as a function of gap size. All aligners 
do well in predicting small gaps, but large gaps present problems. Most aligners, including Pecan 
which uses an extra pair-HMM state to model long gaps, tend to predict long gaps as a series of 
short gaps interspersed with alignments of unrelated sequence. We refer to such behavior as "gap 
dribble." Progressive Mauve was run with default parameters (proMauve), without the Homology 
HMM (proMauve_no_HMM), with the option to assume genomes are coUinear (proMauve_col) , and 
finally assuming collinearity and without the HMM (proMauve_coLno JHMM). 
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non-homologous sequence (nonsingular TP). From the figure, it is obvious that aligners which use 
an affine gap penalty tend to perform poorly in predicting large gaps. Somewhat surprisingly, the 
pair-HMM with an extra gap state used by Pecan to model long indels still yields poor predictions 
of long gaps, although sensitivity is quite good (not shown). Progressive Mauve appears to perform 
well at all gap sizes, especially when the aligner is told to explicitly assume the genomes are collinear 
(proMauve.col) . To determine whether Progressive Mauve's performance results from its anchor- 
ing algorithm or use of the Homology HMM to reject alignments of unrelated sequence, we also 
tested Progressive Mauve without the Homology HMM, shown as panels proMauvejio_HMM and 
proMauve_col_no_HMM. Without the Homology HMM Progressive Mauve yields inferior results, 
indicating that the Homology HMM does indeed address the problem of gap dribble. 



DISCUSSION 



Progressive Mauve alignments enable a wide variety of downstream research. Here we illustrate 
some applications with an alignment of 23 complete E. coli, Shigella and Salmonella genomes. 
The alignment can be used to characterize the shared (core) and total (pan-genome) amount of 
sequence found in these species. In related work, we demonstrate a method to statistically infer 
the phylogenetic history of gene flux and identify sudden changes in the overall rate of gene flux in 
ancestral lineages (see Didelot et al joint submission with this manuscript). The alignment can also 
be used to extract variable sites for more traditional phylogenetic analyses. Finally, we demonstrate 
that Progressive Mauve identifies both conserved regulatory regions and hypervariable intergenic 
regions. 

When applied to twenty-three E. coli, Shigella, and Salmonella genomes, the resulting alignment 
reveals a core genome of 2,675 segments conserved among all taxa, which account for an average 
of 2.46 Mbp of each genome. Between the core segments lie regions conserved among subsets of 
taxa and regions unique to individual genomes. By counting each core, unique, and subset segment 
exactly once, one constructs a total pan-genome that includes genes and intergenic regions alike. 
The 23 genomes have a pan-genome of 15.2 Mbp, approximately three times that of a single strain, 
indicating a tremendous degree of variability in both genie and intergenic content. 

Shigella spp. are widely recognized as E. coli based phylogenetic analyses (Pupo et al. 20001 
and genome comparisons (Yang et al. , 2007| ), though the original phenotypically derived taxonomy 
persists. We will refer to them collectively as E. coli/Shigella. Similarly, taxonomic revisions 
of Salmonella, have collapsed almost all strains into a single species: S. enterica. Thus, we are 
examining the structure of the pan and core genomes of two sister species, E. coli/Shigella and 
Salmonella. The 16 E. coli/Shigella strains have a pan-genome of 12.5 Mbp and core of 2.9 Mbp, 
while the seven S. enterica serovars have a pan-genome of 5.8 Mbp and a core genome of 4.1 Mbp. 
The intersection of the core genomes is the joint core, while the union of the pan-genomes is the 
combined pan-genome, shown in Figure [6} Note that the intersection of pan-genomes is 580 kb 
larger than the joint core. This counter-intuitive situation arises when components of the core- 
genome of one group are found in some, but not all members of the other species. In this instance, 
220 kb can be attributed to losses of genes in Shigella strains that are otherwise conserved among 
all E. coli and Salmonella. A more detailed dissection of the patterns of gene flux in the E. 
coli/Shigella lineage appears in the joint Didelot et al submission. 



Inference of genome rearrangement history 

Progressive Mauve alignments also make an excellent starting point for analysis of genome rear- 
rangement patterns. Genome rearrangement is known to occur via a multitude of mutational forces, 
including inversion, transposition, and duplication/loss, and is especially prominent in bacterial 
pathogens. Methods already exist to infer inversion histories among pairs of genomes (Hannenhalli 
and Pevzner 1995 Tannier and Sagot] 2004) and multiple genomes ( Tang and Moret, ,2003, jLarget 
et al. 2002 1 . More general models to account for multiple chromosomes and multi-break rearrange- 
ments have also been developed (Yancopoulos et al. 2005 , Bergeron et al. 2006 Alekseyev 2008), 



although not yet in either phylogenetic or Bayesian contexts. 
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Figure 6: Venn diagram of the pan-genome (left) and core genome (right) of E. coli/Shigella and 
5*. enterica. 



Most genome rearrangement history inference methods do not also infer gene gain and loss, 
but instead assume that gene content across genomes is equal. When gene content is nearly equal, 
current models can use a multiple genome alignment to infer patterns of selection on genome 
rearrangement (Darling et al. 2008j). However, equal gene content has proved to be the exception 
rather than the rule. Despite that, a Progressive Mauve alignment with differential content can 
be trivially reduced to contain only segments conserved among all taxa of interest, yielding a 
signed gene-order permutation matrix that is suitable for current genome rearrangement inference 
software. 



Alignment visualization 

Genome alignments are large and complex entities that are not usually suitable for direct interpreta- 
tion. Genome comparison browsers such as the UCSC browser (Miller et al. 2007), VISTA (Mayor 



et al. 20001, and others have proven invaluable as tools to facilitate understanding of whole genome 



alignments. To aid in use of Progressive Mauve alignments, we have developed an interactive visu- 
alization program that can present a complex alignment in a meaningful and easily understandable 
visual paradigm. 

The visualization system illustrates three major aspects of genome evolution: genome rear- 
rangement, patterns of segmental gain and loss, and selective pressures for or against regional 
conservation of nucleotide sequences. Figure [7] illustrates two of these aspects in a visualization of 
the 23-way alignment of E. coli, Shigella, and Salmonella. 

Figure [7] shows the region surrounding the yhjE gene, which encodes a product in the Major 
Facilitator Superfamily of transporters. yhjE is flanked by yhjD to the left, and yhjG to the right. 
The intergenic regions between these three genes are hypervariable (as indicated by the variety of 
colors in Figure and have been subject to multiple insertion and deletion events. The hyper- 
variable nature of the regions surrounding yhjE may not be surprising, because it harbors a REP 
element to the left, and a RIP element to the right. REP elements contain a series of two or more 
35- bp palindromic repeats and are known for a variety of functions, including the binding of DNA 
gyrase and Poll, and as mRNA anti-decay hairpins or rho-dependent attenuators. RIP elements 
are a specialized form of REP elements that contain an IHF binding site, and this particular RIP 
also contains a REPt transcription terminator sequence. IHF is a global transcriptional regulator 
in E. coli. 

Interestingly, the patterns of insertion/deletion in the intergenic regions surrounding yhjE do 
not follow the expected taxonomic patterns, suggesting instead that recombination among strains 
has taken place. The RIP region present to the right of yhjE in most E. coli has been replaced 
with an unrelated sequence in E. coli E23477A and S. boydii (shown as turquoise in Figure [7|, but 
not in S. sonnei. Those three strains form a clade in the E. coli/Shigella taxonomy (Didelot et al 



2008) with E. coli E23477A branching first, so convergent evolution must have occurred here. 



The pattern of intergenic variability surrounding yhjE suggests potential regulatory divergence, 
a much studied evolutionary mechanism in eukaryotes largely overlooked in prokaryotic research. 
The yhjE locus is by no means the only region harboring intergenic variablity; a screen of the 23- 
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Figure 7: A Mauve visualization of the hypervariable intergenic regions surrounding yhdE. Each 
genome is laid out in a horizontal track, with annotated coding regions shown as white boxes. A 
colored similarity plot is shown for each genome, the height of which is proportional to the level of 
sequence identity in that region. When the similarity plot points downward it indicates an align- 
ment to the reverse strand of the genome. Colors in the similarity plot indicate the combination 
of organisms containing a particular segment of the genome. Segments colored pink/mauve are 
conserved among all organisms, while purple segments are conserved in everything but Salmonella, 
and segments colored in olive green are conserved among non-uropathogenic E. coli. The visual- 
ization system is interactive and written in Java, and works on all computers supporting Java 1.4 
or later. 



way alignment identifies 102 other strictly intergenic regions with similarly variable conservation 
patterns. 



Conclusions 

We have presented a novel multiple genome alignment heuristic which demonstrates a substantial 
accuracy improvement on simulated datasets. Key features of the approach are an anchor scoring 
function that penalizes alignment anchoring in repetitive regions of the genome and penalizes 
genomic rearrangement. Use of a Sum-of-pairs approach enables robust scoring of genomes that 
have undergone both gene flux and rearrangement — a scenario not addressed by previous alignment 
methods. 

Future efforts to improve genome alignment may explicitly incorporate models of evolution- 
ary distance into alignment scoring process (Fu et al. 2007 1 . Multiple alignment methods based 



on probabilistic consistency have demonstrated great promise in the context of amino-acid align- 
ment (Do et al. 2005) and aligning collinear genomic regions (Paten et al. 2008), and in principle. 



could be also extended to genome alignment with rearrangement. 

No method reconstructs error-free genome alignments, and any particular alignment is likely to 
contain errors that can substantially influence downstream inference. However, methods to estimate 



the confidence in aligned columns are under continuing development (Lunter et al. 2008). Down 



stream inference methods that can explicitly cope with the inherent uncertainty in reconstructed 
aligimients will be crucial for continued advances in comparative genomics. 
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APPENDIX 



Organism 


Genome size w/Plasmids 


Accession 


E. coll K-12 MG1655 


4,654,221 


U00096 


E. coli K-12 W3110 


4,646,332 


AP009048 


E. coli HS 


4,643,538 


AAJYOOOOOOOO 


E. coli 0157:H7 EDL933 


5,623,806 


AE005174 


E. coli 0157:H7 Sakai 


5,594,477 


BA000007 


E. coli E24377A 


4,980,187 


AAJZOOOOOOOO 


E. coli CFT073 (UPEC) 


5,231,428 


AE014075 


E. coli UT189 (UPEC) 


5,179,971 


CP000243 


E. coll APEC 01 


5,082,025 


CP000468 


E. coli 536 (UPEC) 


4,938,920 


CP000247 


Shigella boydii Sb227 


4,646,520 


CP000036 


Shigella flexneri 2a 2457T 


4,988,914 


AE014073 


Shigella flexneri 2a 301 


4,828,821 


AE005674 


Shigella flexneri 5 8401 


4.574.284 


CP0002G6 


SliujcUa ihjHciitcfuu' S(lli)7 


4..>')1.<).')8 


CP000034 


Shigella sonnei Ss046 


5,039,661 


CP000038 


Salmonella enterica Cholcracsius B67 


4,944,000 


AEO 17220 


Salmonella enterica Typhi Ty2 


4,791,961 


AE014613 


Salmonella enterica Typhi CT18 


5,133,713 


AL513382 


Salmonella enterica Typhimurium LT2 


4,951,371 


AE006468 


Salmonella enterica Paratyphi A ATCC9150 


4,585,229 


CP000026 


Salmonella enterica Arizonae 


4,600,800 


NC_010067 


Salmonella enterica Paratyphi B SPB7 


4,858,887 


NC_010102 



Table 1: (Supplementary) Twenty-three publicly- available, finished genome sequences from the 
genera Salmonella, Escherichia, and Shigella form our target set for multiple genome alignment. 
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Nt Sensitivity 



NtPPV 



Indel Sensitivity Indel PPV 




0.4 0.8 0.4 0.8 0.4 0.8 0.4 

Nucleotide substitutions per site among most distant taxa 



Indel Sensitivity Indel PPV 



0.8 




02468 02468 02468 024 

Small gene flux events among most distant taxa (thousands) 



— Accuracy Key , 

Percent (%) 
20 40 60 80 100 



Simulation tree topology 



Figure 8: Supplementary The performance of MLAGAN 2.0 and MAVID 2.0 on genomes evolved 
with inversions, gene gain, and gene loss corresponding to Figure |4] Neither MLAGAN nor MAVID 
were designed to align genomes with rearrangement and differential content; this figure demon- 
strates the decay in alignment quality when such forces are present in the data but not modeled 
by the aligner. The black dot in the upper right corner of the MLAGAN results indicates that 
MLAGAN did not finish alignments at the highest combination of inversion and substitution rate 
within the allotted 10 hour limit. 
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